I Captured The Ghosts In The Machine (And Named It Prism)
Most distillation datasets are flat. They show you what the AI said. They do not show you what the AI thought about saying. They show you the destination. They hide the journey. I decided to capture the journey. I decided to name it Prism.
CompactAI-Prism is now a thing. Actually, two things. Maybe a third thing soon. I have been busy while Haiku-2 outputs pipe characters at me.
A prism takes a single beam of light and refracts it to reveal the full spectrum. This dataset takes a single AI response and reveals the spectrum of probability that existed at every step.
The Datasets
I have two datasets out now. A third is in the planning phase. My hard drive hates me. My GPU loves me. I am somewhere in the middle questioning my life choices.
| Dataset | Size | Teacher Model | Top-K |
|---|---|---|---|
| cAI-Prism-K50 | 247 MB | Qwen3.5-0.8B | 50 |
| cAI-Prism-B.5-K50 | 2.2 GB | Qwen3.5-2B | 50 |
The third one exists only in my dreams and a partially written Python script. It will be large. It will be expensive. It will be worth it. Probably.
What Does B Stand For
Good question. I needed something for the 2B model variant. K50 means Top-50 tokens. That is clear. B needed meaning. I considered options:
B.5 it is. It sounds technical. It sounds intentional. It sounds like I planned this instead of naming things at 3 AM while waiting for training to finish. Which is exactly what happened.
The B.5 designation means it sits between the base K50 and whatever comes next. Maybe B.6. Maybe C. Maybe I run out of letters and start using Greek. Omega Prism has a nice ring to it.
The Concept
Normal distillation gives you the answer. Prism gives you the answer plus the top 50 alternatives for every single token. You see the chosen path. You also see the 50 paths not taken. You see the logprobs. You see the uncertainty.
The math is simple. Each token in the response equals x. The number of questions equals y. The total data points equal x times y times 50. This is basically 50 times more training data per prompt compared to standard datasets.
Why Qwen 3.5
Not 2.5. Not 3.0. Specifically 3.5. The 0.8B and 2B variants from the Qwen3.5 family. They are small enough to run locally. They are smart enough to teach my models. They are open enough to share logits.
I could have used larger models. I could have used closed APIs. I did not. My wallet said no. My principles said no. My hard drive also said no but I ignored that one.
The Data Structure
It is JSONL. It is standard. It is easy to parse. It is also massive. Here is what a single entry looks like. Notice the token_logprobs array. Notice the top_k list inside each token. That is the gold.
The Full Vocab Dream
The third dataset plans to capture everything. Not just 50 tokens. Not just 100 tokens. The full vocabulary. Qwen3.5 has 248,320 tokens. I want to capture all of them for every generation step.
The third dataset would be 248,320 times more dense than any other dataset that has 799 examples per token. This level of density maps the entire semantic landscape. Standard datasets show the path taken. Full vocab shows the terrain.
Imagine knowing every road not taken. Most models learn the highway. This dataset teaches the backroads. It shows the model that "Why?" lives in a different neighborhood than "Hey! whats up?". The probability distance tells a story. Low probability means semantic distance. High probability means similarity.
The student model learns the shape of language itself. It understands that certain responses belong together. It understands tone. It understands context. It sees the probability mass surrounding each decision. A model trained on this knows why "Research" fits better than "To" in a formal context. It sees the weight of every possibility.
Will it work? I hope so. Can my hard drive handle it? Probably not. Will I try anyway? Absolutely. A single response could become gigabytes. A dataset could become terabytes. I am aware of the scale. I am proceeding with caution. And hope. Mostly hope.
Why This Matters
Training on just the final tokens teaches mimicry. Training on the probability distribution teaches reasoning. The student model learns why "Paris" was chosen over "London". It learns why "Research" was chosen over "To". It learns the shape of the decision.
This is how you make small models smart. You do not just show them the answer. You show them the thinking. You show them the doubt. You show them the confidence. Prism does all of this.
What Comes Next
Both datasets are MIT licensed. Both are on Hugging Face. Both have 50 top tokens. Both are free. Use them. Fork them. Train your tiny models on them. Make something better than my tiny models.
Haiku-2 will train on Prism. Sonnet-2 will train on Prism. Opus will train on Prism. Maybe they will learn to speak. Maybe they will learn to think. Maybe they will still output pipe characters. We will find out together.
Final Thoughts
Prism exists now. Two versions. A third dreaming of full vocab. 50 times denser than standard datasets. Longer training times. Smarter models. Full hard drives. Empty wallet. Full heart.
This is something real. This is something useful. This helps the little guys train better models without spending millions on API calls. That is the goal. That is the dream. That is Prism.